The first dataset we will analyze is the dataset that contains information about the businesses.
The businesses are stored in the file yelp_academic_dataset_business.json
. There are 15.585 businesses.
Each business contains 15 fields which are explained below:
type
: Contains the type of the data. All rows in the yelp_academic_dataset_business.json
file have the value business
in the type field.business_id
: Contains an encrypted business id.name
: The name of the business.neighborhoods
: It's supposed to tell the neighborhoods where the business belongs, but for all rows this field contains an empty array.full_address
: The local address of the business.city
: The city where the business is located.state
: The state where the business is located. Of the 15.585 businesses only 3 of them don't belong to Arizona.latitude
: The latitude of the business.longitude
: The longitude of the business.stars
: The star rating that user have given to this business. The rating is rounded to half-stars.review_count
: The number of reviews this business has received.categories
: An array with the categories which this business belongs. For example, [Restaurant, Bar, Mexican]
open
: I think this indicates if the business is still operating.hours
: A dictionary with the opening hours of the business for each day of the week.attributes
: A dictionary with additional information about the business. For example 'Accepts credit cards'
, 'Delivery'
, 'Price range'
, 'Parking'
, etc.
In [1]:
import json
from pandas import DataFrame
business_file = 'yelp_academic_dataset_business.json'
business_records = [json.loads(line) for line in open(business_file)]
business_data_frame = DataFrame(business_records)
column = 'stars'
business_counts = business_data_frame.groupby(column).size()
#print counts
In [2]:
business_counts.plot(kind='bar', rot=0)
Out[2]:
To speed up things, I have created a couple of functions that automatically plot the data when you pass the JSON file name
In [3]:
def plot_json_file(file_path, column, plot_type='line', title=None,
x_label=None, y_label=None, show_total=True,
show_range=False, y_scale='linear'):
"""
Creates a DataFrame object from a JSON file and returns the plot of the
data including the values for the mean, median, standard deviation, and
if requested, the sum of all the values, and a range with the minimum
and maximum values
@param file_path: the absolute path of the JSON file that contains the
data
@param column: the column which will be used to group and count the data
@param plot_type: the type of graph. For example 'bar', 'barh', 'line',
etc.
@param title: the title of the graph
@param x_label: the label for the x axis
@param y_label: the label for the y axis
@param show_total: a boolean which indicates if the sum of all the
values should be displayed on the graph
@param show_range: a boolean which indicates if the minimum and maximum
values should be displayed on the graph
"""
records = [json.loads(line) for line in open(file_path)]
# Inserting all records stored in form of lists in to 'pandas DataFrame'
data_frame = DataFrame(records)
plot_data(data_frame, column, plot_type, title,
x_label, y_label, show_total, show_range, y_scale)
def plot_data(data_frame, column, plot_type='line', title=None,
x_label=None, y_label=None, show_total=True,
show_range=False, y_scale='linear'):
counts = data_frame.groupby(column).size()
mean = data_frame.mean()[column]
std = data_frame.std()[column]
median = data_frame.median()[column]
label = 'mean=' + str(mean) + '\nmedian=' + str(
median) + '\nstd=' + str(std)
if show_total:
total = data_frame.sum()[column]
label = label + '\ntotal=' + str(total)
if show_range:
min_value = data_frame.min()[column]
max_value = data_frame.max()[column]
label = label + '\nrange=[' + str(min_value) + ', ' + str(
max_value) + ']'
fig, ax = plt.subplots(1)
counts.plot(kind=plot_type, rot=0)
ax.set_xlabel(x_label)
ax.set_ylabel(y_label)
ax.set_title(title)
ax.set_yscale(y_scale)
# these are matplotlib.patch.Patch properties
properties = dict(boxstyle='round', facecolor='wheat', alpha=0.95)
ax.text(0.05, 0.95, label, fontsize=14, transform=ax.transAxes,
verticalalignment='top', bbox=properties)
If we call this function we can obtain the same results (or even better ones!)
In [4]:
plot_json_file('yelp_academic_dataset_business.json', 'stars', 'bar',
'Businesses\' ratings', 'Rating', 'Number of places', False)
As can be seen on the graphic above, most of the businesses have an average rating of either 3.5 or 4. Very few businesses have bad ratings.
Now we will continue to analyze the businesses' data but this time looking at the number of reviews that each business has.
In [5]:
reviews_plot = plot_json_file(business_file, 'review_count', 'line', 'Reviews per business',
'Review count', 'Frequency', True, True, 'log')
We can see that the great majority of businesses have a very few number of reviews (as shown by the median). In average, each business has around 23 reviews. The business with least reviews has 3, and the business with most reviews has 1170.
This is the biggest dataset given in the Yelp data challenge, the file name is yelp_academic_dataset_review.json
. There are 335,022 reviews.
Each review contains 7 fields which are explained below:
type
: Contains the type of the data. All rows in the yelp_academic_dataset_review.json
file have the value review
in the type field.business_id
: Contains an encrypted business id.user_id
: Contains an encrypted user id.stars
: The rating awarded in the review.text
: The text for the review.date
: The date of the review in 'yyyy-mm-dd' format.votes
: A dictionary with the votes that the review has received. There are three categories for the votes: 'coll'
, 'funny'
and 'useful'
.As it can be seen in the image above, the majority of user ratings are 4 and 5 stars, and there is a huge gap between 3 and 4 stars ratings. One could say one of two things: users usually don't rate bad places, or there are fewer bad places than good places.
Some useful stats are:
This dataset contains information about 70,817 users, which are stored in the yelp_academic_dataset_user.json
file.
Each user contains 11 fields which are explained below:
type
: Contains the type of the data. All rows in the yelp_academic_dataset_user.json
file have the value user
in the type field.user_id
: Contains an encrypted user id.name
: Contains the first name of the user.review_count
: The number of reviews this users has made.average_stars
: The average rating of this user.votes
: A dictionary with the number of votes this user has made, for each type of vote. There are three categories for the votes: 'cool'
, 'funny'
and 'useful'
.friends
: A list with the user's friends.elite
: A list with the years that this user has been elite. 93% of the users have an empty list in this field.yelping_since
: The date this user joined yelp in 'yyyy-mm-dd' format.compliments
: A dictionary with the number of votes this user has received, for each type of vote.fans
: The number of fans this user has.
In [6]:
plot_json_file('yelp_academic_dataset_user.json', 'review_count', 'line', 'Reviews per user',
'Review count', 'Frequency', True, True, 'log')
As it can be seen in the above graphic, the numbers between the user's reviews and the reviews data set don't seem to mathc. This is probably because the users of the user dataset have also made reviews in other places, and not just in Phoenix. And the reviews dataset only contains reviews that were made for businesses in Phoenix.
This dataset contains information about 11,434 check-ins, which are stored in the yelp_academic_dataset_checkin.json
file.
Each check-in contains 3 fields which are explained below:
type
: Contains the type of the data. All rows in the yelp_academic_dataset_checkin.json
file have the value checkin
in the type field.business_id
: Contains an encrypted business id.checkin_info
: Contains a dictionary with the number of checkins for each hour and each day of the week.There are a total of 1.457.303 check-ins for 11434 businesses, which gives as a total of 127 check-in per business, and 18 check-ins per business per day.
This dataset contains information about 113,993 check-ins, which are stored in the yelp_academic_dataset_checkin.json
file. A tip seems very similar to a review without a rating.
Each tip contains 6 fields which are explained below:
type
: Contains the type of the data. All rows in the yelp_academic_dataset_tip.json
file have the value tip
in the type field.text
: The text for the tip.business_id
: Contains an encrypted business id.user_id
: Contains an encrypted user id.date
: The date of the review in 'yyyy-mm-dd' format.likes
: The number of likes this tip has received.